Validate checkpoint GPU UUID inputs early by aryanputta · Pull Request #2086 · NVIDIA/cuda-python

aryanputta · 2026-05-14T05:18:00Z

Summary

reject non-UUID gpu_mapping values in cuda.core.checkpoint before they reach driver-facing restore args
normalize malformed UUID string failures to the documented ValueError path
add source-level checkpoint helper tests that run without importing the built cuda.core package

Testing

python3 -m pytest cuda_core/tests/test_checkpoint_helpers.py --noconftest

Signed-off-by: Aryan <aryansputta@gmail.com>

copy-pr-bot · 2026-05-14T05:18:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aryanputta · 2026-05-16T02:58:44Z

@leofang You were right on both points here.
What I changed in commit 083e411:

removed the mock-only cuda_core/tests/test_checkpoint_helpers.py
narrowed _as_cuuuid() so restore mappings now accept UUID strings only
updated the docs to say gpu_mapping should use the strings returned by Device.uuid

On your question specifically... I dont' have a strong public facing use case for accepting CUuuid objects here. The intended caller-facing path is the string UUID returned by Device.uuid, and the existing real checkpoint tests already follow that contract. For example, cuda_core/tests/test_checkpoint.py builds migration mappings from devices[i].uuid in _build_rotation_mapping() and runs the driver-backed checkpoint/restore scenarios without mocks.

aryanputta · 2026-06-08T15:07:19Z

@leofang I pushed the requested CUuuid compatibility restore in 3261c10. The metadata check is still failing because the PR has no milestone, and GitHub does not let me set one on the upstream PR. Could you set the milestone to cuda.core v1.1.0 when you get a chance?

leofang · 2026-06-08T15:59:18Z

/ok to test 3261c10

github-actions · 2026-06-08T16:32:12Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2086/
https://nvidia.github.io/cuda-python/pr-preview/pr-2086/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2086/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2086/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Aryan Putta <aryansputta@gmail.com>

aryanputta · 2026-06-08T21:07:28Z

@leofang I repushed this as 1520ee3 to retrigger CI because I do not have permission to rerun failed jobs on the upstream repo. I inspected the previous run before repushing: the failures looked unrelated to the checkpoint change (aarch64 sccache/GitHub cache 503, WSL NVML process-name UnicodeDecodeError, Windows jit_lto_fractal driver/backend mismatch, and one IPC flake). Could you approve the new SHA with /ok to test when convenient?

…uuid-validation

aryanputta · 2026-06-08T21:11:27Z

@leofang quick update: I moved this branch to current main as 41507c5 after confirming the previous WSL get_process_name failures were covered by #2118. The effective diff against current main is still only cuda_core/cuda/core/checkpoint.py. Could you approve the new SHA with /ok to test when convenient?

leofang · 2026-06-08T21:15:18Z

@aryanputta plz be patient and wait for one of us to check the CI. We can trigger re-runs of the failing jobs only, but if you merge with main we have to start from beginning and run all jobs, not just the failing one.

leofang · 2026-06-08T21:16:12Z

Also our CI is known flakey for a few things and rerun usually solves the problem.

aryanputta · 2026-06-08T21:21:43Z

@aryanputta plz be patient and wait for one of us to check the CI. We can trigger re-runs of the failing jobs only, but if you merge with main we have to start from beginning and run all jobs, not just the failing one.

Sorry about that, @leofang! I was told in the past not to force-push/rebase on shared branches to avoid disrupting reviewers, so I thought moving it to main would be the cleaner approach. I will definitely keep a note for myself to just leave it be and wait for a maintainer to trigger the specific job reruns next time.

Also, just wanted to let you know that I fixed the other PR #2074 we were working on whenever you have some time to check it out.

Thank you!

leofang · 2026-06-09T22:43:01Z

/ok to test 1ff00e5

leofang · 2026-06-09T22:44:56Z

I was told in the past not to force-push/rebase on shared branches to avoid disrupting reviewers

That is correct, for the very same reasons: When the CI pipelines are triggered, it's based on a particular snapshot (commit). Any action that results in a new commit (ex: merge with main, rebase, force-push, ...) would forfeit the opportunity of re-running.

Validate checkpoint GPU UUID inputs

c62413e

Signed-off-by: Aryan <aryansputta@gmail.com>

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 14, 2026

leofang requested changes May 16, 2026

View reviewed changes

Comment thread cuda_core/tests/test_checkpoint_helpers.py Outdated

leofang reviewed May 16, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

Narrow checkpoint GPU UUID restore inputs

083e411

leofang reviewed Jun 8, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

leofang assigned leofang and aryanputta Jun 8, 2026

leofang added enhancement Any code-related improvements P1 Medium priority - Should do labels Jun 8, 2026

Restore checkpoint CUuuid compatibility

3261c10

leofang approved these changes Jun 8, 2026

View reviewed changes

leofang added this to the cuda.core v1.1.0 milestone Jun 8, 2026

Retry CI after infra failures

1520ee3

Signed-off-by: Aryan Putta <aryansputta@gmail.com>

Merge remote-tracking branch 'upstream/main' into fix-checkpoint-gpu-…

41507c5

…uuid-validation

Merge branch 'main' into fix-checkpoint-gpu-uuid-validation

1ff00e5

leofang enabled auto-merge (squash) June 9, 2026 22:45

leofang mentioned this pull request Jun 9, 2026

cuda.core: harden graph user-object destructor during Python shutdown #2074

Open

Conversation

aryanputta commented May 14, 2026

Summary

Testing

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

aryanputta commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aryanputta commented Jun 8, 2026

Uh oh!

leofang commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

aryanputta commented Jun 8, 2026

Uh oh!

aryanputta commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 8, 2026

Uh oh!

leofang commented Jun 8, 2026

Uh oh!

aryanputta commented Jun 8, 2026

Uh oh!

leofang commented Jun 9, 2026

Uh oh!

leofang commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aryanputta commented May 16, 2026 •

edited

Loading

aryanputta commented Jun 8, 2026 •

edited

Loading